Project Brief

You are working as a data analyst for a yacht and boat sales website. The marketing team is preparing a weekly newsletter for boat owners. The newsletter is designed to help sellers to get more views of their boat, as well as stay on top of market trends. They would like you to take a look at the recent data and help them learn more about the characteristics of the most viewed boat listings in the last 7 days - is it the most expensive boats that get the most views? Are there common features among the most viewed boats?

Loading required libraries

library(data.table)
library(skimr)
library(tidyverse)
library(plotly)
library(priceR)

source("Functions.R")

Importing the data

dt <- fread("https://s3.amazonaws.com/talent-assets.datacamp.com/boat_data.csv")

Initial EDA

skim(dt)
Data summary
Name dt
Number of rows 9888
Number of columns 10
Key NULL
_______________________
Column type frequency:
character 6
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Price 0 1 7 12 0 3182 0
Boat Type 0 1 3 43 0 126 0
Manufacturer 0 1 0 42 1338 911 0
Type 0 1 0 28 6 25 0
Material 0 1 0 19 1749 12 0
Location 0 1 0 84 36 2841 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Year Built 0 1.00 1893.19 460.20 0.00 1996.00 2007.00 2017.00 2021.00 ▁▁▁▁▇
Length 9 1.00 11.57 6.00 1.04 7.47 10.28 13.93 100.00 ▇▁▁▁▁
Width 56 0.99 3.52 1.22 0.01 2.54 3.33 4.25 25.16 ▇▁▁▁▁
Number of views last 7 days 0 1.00 149.16 151.82 13.00 70.00 108.00 172.00 3263.00 ▇▁▁▁▁

Few points to note :

dt$view_pct <- 100*dt$`Number of views last 7 days`/sum(dt$`Number of views last 7 days`)

dt <- dt %>% 
  separate(Price,c("Currency","Amount"),sep=" ",remove=TRUE,convert=TRUE)

dt <- dt %>% 
  separate(Type,c("Condition","Fuel"),sep=",",remove=TRUE,convert=TRUE)

dt[Condition=="",Condition:=NA]

dt[Condition=="Diesel" & is.na(Fuel),Fuel:="Diesel"]
dt[Condition=="Diesel" & Fuel=="Diesel",Condition:=NA]

dt[Condition=="Unleaded" & is.na(Fuel),Fuel:="Unleaded"]
dt[Condition=="Unleaded" & Fuel=="Unleaded",Condition:=NA]

dt[Condition=="Electric" & is.na(Fuel),Fuel:="Electric"]
dt[Condition=="Electric" & Fuel=="Electric",Condition:=NA]

dt[,Condition := as.factor(Condition)]
dt[,Fuel := as.factor(Fuel)]
dt[,Material := as.factor(Material)]

Now lets convert the Amount column to a standard currency amount, CAD in this case (based on the currency in “Currency” column and the average exchange rate of that currency to CAD from 1st Jan of this year till yesterday)

conversions <- sapply(unique(dt$Currency),avg_ex) %>% 
  data.frame() %>% 
  rownames_to_column() %>%
  `colnames<-`(c('Currency','conv'))

dt <- dt %>% 
  left_join(conversions,by='Currency') %>%
  mutate(Amount_CAD = Amount*conv) %>% 
  select(-c(14))
  

dt[,Currency := as.factor(Currency)]

Lets skim through our data once more -

skim(dt)
Data summary
Name dt
Number of rows 9888
Number of columns 14
Key NULL
_______________________
Column type frequency:
character 3
factor 4
numeric 7
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Boat Type 0 1 3 43 0 126 0
Manufacturer 0 1 0 42 1338 911 0
Location 0 1 0 84 36 2841 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Currency 0 1.00 FALSE 4 EUR: 8430, CHF: 980, £: 298, DKK: 180
Condition 86 0.99 FALSE 4 Use: 7327, new: 2084, new: 272, Dis: 119
Fuel 2212 0.78 FALSE 6 Die: 4568, Unl: 3040, Ele: 52, Gas: 13
Material 0 1.00 FALSE 12 GRP: 5484, emp: 1749, PVC: 1123, Ste: 939

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Amount 0 1.00 320137.34 1007482.24 3300.00 44000.00 95000.00 255000.00 31000000.00 ▇▁▁▁▁
Year Built 0 1.00 1893.19 460.20 0.00 1996.00 2007.00 2017.00 2021.00 ▁▁▁▁▇
Length 9 1.00 11.57 6.00 1.04 7.47 10.28 13.93 100.00 ▇▁▁▁▁
Width 56 0.99 3.52 1.22 0.01 2.54 3.33 4.25 25.16 ▇▁▁▁▁
Number of views last 7 days 0 1.00 149.16 151.82 13.00 70.00 108.00 172.00 3263.00 ▇▁▁▁▁
view_pct 0 1.00 0.01 0.01 0.00 0.00 0.01 0.01 0.22 ▇▁▁▁▁
Amount_CAD 0 1.00 444452.15 1427370.31 4571.60 63093.68 133971.92 364701.33 46145883.03 ▇▁▁▁▁

Analysis

Lets look at the log distribution of the “Number of views last 7 days” metric.

hist(log(dt$`Number of views last 7 days`))
abline(v=mean(log(dt$`Number of views last 7 days`)),col="red",lwd=3)

Above is a histogram of its log distribution. The red vertical line represents the mean of the distribution. The log mean is 4.73 which corresponds to 113.3. What is means is that on average, each listing gets about 113.3 views in 7 days.

Now lets look at the quantiles of the same, to get more idea about the distribution of views.

quantile(dt$`Number of views last 7 days`,c(.7,.8,.9,.95,.99))
##    70%    80%    90%    95%    99% 
## 154.00 196.00 283.00 389.00 773.65

As we can see, the top 10% listings in terms of number of views are the ones that have over 283 views in the last 7 days. The top 5% listings get over 389 views in 7 days. This shows that the data is heavily skewed towards lower values of “Number of views last 7 days”, with only a few listings getting much higher views.

Which boat types get most views?

Lets look at the proportion of views per boat type, top 10 boat types in terms of views -

dt[, .(views = sum(`Number of views last 7 days`),views_prop = sum(view_pct)),by=.(`Boat Type`)][order(-views)][1:10]
##               Boat Type  views views_prop
##  1:         Motor Yacht 351659  23.842872
##  2:          Sport Boat 251418  17.046421
##  3:          Cabin Boat 131700   8.929407
##  4:           Flybridge 127534   8.646947
##  5:             Trawler  75456   5.116001
##  6:          Pilothouse  74730   5.066777
##  7:             Hardtop  65343   4.430328
##  8:            Bowrider  47183   3.199060
##  9: Center console boat  37015   2.509658
## 10:        Fishing Boat  32434   2.199061
pie_plot(dt$`Boat Type`)

We can see that “Motor Yacht” boats are getting almost 24% of the total views in the last 7 days, followed by “Sport Boat” at 17% and “Cabin boat” & “Flybridge” at about 8-9% each.

Motor Yachts seem to be the most popular type of boats.

Which Manufacturers boats are getting most views?

Lets look at the proportion of views per Manufacturer, top 10 in terms of views -

dt[Manufacturer!="", .(views = sum(`Number of views last 7 days`),views_prop = sum(view_pct)),by=.(Manufacturer)][order(-views)][1:10]
##               Manufacturer views views_prop
##  1: Bénéteau power boats 56211   3.811168
##  2:  Sunseeker power boats 51110   3.465315
##  3:   Jeanneau power boats 51070   3.462603
##  4:    Sea Ray power boats 45387   3.077289
##  5:   Princess power boats 39051   2.647701
##  6:    Bavaria power boats 35569   2.411618
##  7:   Bayliner power boats 34555   2.342868
##  8:    Cranchi power boats 32442   2.199604
##  9:     Azimut power boats 24339   1.650211
## 10:   Fairline power boats 24276   1.645940
pie_plot(dt$Manufacturer)

Here we see that for a lot of listings, which constitue about 11% of the total views, the Manufacturer is not mentioned. Nonetheless, the most popular manufacturers are -

  • Bénéteau power boats with almost 4% view share
  • Sunseeker power boats , Jeanneau power boats with 3.5% view share
  • Sea Ray power boats with about 3.1% view share

Which year built is getting most views?

dt[, .(views = sum(`Number of views last 7 days`),views_prop = sum(view_pct)),by=.(`Year Built`)][order(-views)][1:10]
##     Year Built  views views_prop
##  1:       2020 175872  11.924318
##  2:       2019  88438   5.996195
##  3:          0  72302   4.902156
##  4:       2006  52614   3.567288
##  5:       2008  51351   3.481655
##  6:       2007  48392   3.281032
##  7:       2017  42420   2.876123
##  8:       2018  42030   2.849681
##  9:       2005  41354   2.803847
## 10:       2004  37143   2.518337
pie_plot(dt$`Year Built`)

Boats from the year 2020 get the most views, about 12% of all views in the last 7 days were of boats built in 2020.

Which Currency is most viewed?

dt[, .(views = sum(`Number of views last 7 days`),views_prop = sum(view_pct)),by=.(Currency)][order(-views)]
##    Currency   views views_prop
## 1:      EUR 1075324  72.908166
## 2:      CHF  347993  23.594313
## 3:       £   28208   1.912534
## 4:      DKK   23377   1.584987
pie_plot(dt$Currency)

EUR and CHF currencies are most looked at, with the other 2 being minority.

Are more expensive boats viewed more?

dtx <- log(dt$Amount_CAD)
dty <- dt$`Number of views last 7 days`
h <- hist(log(dt$Amount_CAD))

breaks <- data.frame("beg"=h$breaks[-length(h$breaks)],"end"=h$breaks[-1])
sums <- apply(breaks, MARGIN=1, FUN=function(x) { sum(dty[dtx >= x[1] & dtx < x[2] ]) })
h$counts <- sums
mean <- mean(log(dt$Amount_CAD))
median <- median(log(dt$Amount_CAD))
plot(h, ylab="Total Views (L7D)", main="Sum of L7D views Within Bins")
abline(v=mean,col="red",lwd=3)
abline(v=median,col="blue",lwd=3)

Most views are ones where the price is close to the mean of the distribution. Thus, no - the boats that are more expensive do not get more views. Most views are at a price range of around CAD 154218.

Conclusion -

After going through the data, the following conclusions can be made -